Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expose health info from Block Stream/Executor RPC #889

Merged
merged 12 commits into from
Jul 18, 2024

Conversation

morgsmccauley
Copy link
Collaborator

This PR exposes a health field on the Block Stream/Executor info, which can be accessed via RPC. The intent of this field is for Coordinator to monitor it, and then act accordingly. I wanted to raise this work first, so that the overall result is not too large.

Essentially, health contains only a single enum describing the "state" of the process, but this can be expanded over time as needed.

@morgsmccauley morgsmccauley requested a review from a team as a code owner July 17, 2024 09:33
#[derive(Clone)]
pub struct BlockStreamHealth {
pub processing_state: ProcessingState,
pub last_updated: SystemTime,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Runner which pushes the information up, Block Streamer polls it. For this reason I've added a timestamp, so that we can determine whether the data is stale or not. Stale will probably result in restart.

@morgsmccauley morgsmccauley linked an issue Jul 17, 2024 that may be closed by this pull request
@@ -162,7 +162,7 @@ async function blockQueueConsumer (workerContext: WorkerContext): Promise<void>
});

const postRunSpan = tracer.startSpan('Delete redis message and shift queue', {}, context.active());
parentPort?.postMessage({ type: WorkerMessageType.STATUS, data: { status: IndexerStatus.RUNNING } });
parentPort?.postMessage({ type: WorkerMessageType.EXECUTION_STATE, data: { state: ExecutionState.RUNNING } });
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexerStatus is outward facing, and we need a bit more granularity for internal use, so opted to create a separate type.

Copy link
Collaborator

@darunrs darunrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me! I'm definitely still confused about the states as they look and feel very distinct from any action items though. A state like INDEXER_BACKFILL or LAKE_BACKFILL make more direct sense to what the block stream is actually doing for example. But I imagine you're keeping the states vague on purpose to avoid having to manage them more frequently? I'd say from a clarity perspective I'd prefer more detailed state information provided there's distinct Coordinator behavior for them.

uint64 updated_at_timestamp_secs = 2;
}

enum ProcessingState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you just want to represent any possible states of the service from the get go? I imagine scenarios for UNSPECIFIED and WAITING are vague right now. Specifically for WAITING, I'm confused what that refers to. Does Runner back pressure count as WAITING? If it does, then it could swing between WAITING and RUNNING repeatedly. Otherwise, I'm not sure. I'm also trying to think what action items Coordinator would intend to have when receiving each of these states.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you just want to represent any possible states of the service from the get go?

Yes essentially. While we only really need Stalled, it was pretty trivial to add these other states.

And yes, Waiting was created specifically for the back pressure case. Doesn't really serve any purpose now. But it could be beneficial to expose these as metrics from Coordinator so we can view what state each indexer is in 🤔

/// Represents the processing state of a block stream
#[derive(Clone)]
pub enum ProcessingState {
/// Block Stream is not currently active but can be started. Either has not been started or was
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A block stream which wasn't started wouldn't even return a state right? It wouldn't be present in the list of block streams.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and no. The RPC logic is to only add "started" BlockStreams. But it's still possible to create a BlockStream and not start it, and it would make sense to default to Running

@morgsmccauley
Copy link
Collaborator Author

I'd say from a clarity perspective I'd prefer more detailed state information provided there's distinct Coordinator behavior for them.

At this stage we don't have requirement for these more granular states. The goal here is to remove the need for manually restarting the Block Streamer/Runner services. To achieve that, we only need to know if the process has stalled.

We could perhaps expand this to have granular states, and then restart if the bitmap backfill abnormally finished, but I wouldn't say that's required in the short term.

Base automatically changed from refactor/dedicated-control-loops to main July 18, 2024 01:56
@morgsmccauley morgsmccauley merged commit 29bde3c into main Jul 18, 2024
7 checks passed
@morgsmccauley morgsmccauley deleted the feat/health-rpc branch July 18, 2024 03:11
This was referenced Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart unhealthy block streams and executors
2 participants